Dataset: In this notebook I will take use of the
Soccer Databaseto wrangle, explore and investigate questions related to soccer data. This database is a collection of observations of football players, including a lot of statistics on their game play, their teams, matches etc and corresponding attributes in the game FIFA Football. The entire database can be downloaded via https://d17h27t6h515a5.cloudfront.net/topher/2017/November/5a0a4cad_database/database.sqlite. To follow the investigation as in this notebook, download the dataset, store it in the same directory as this file and execute the cells.
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
import os, sys, zipfile
import sqlite3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = "retina"
with zipfile.ZipFile(os.path.join(os.getcwd(), "archive.zip"), "r") as fp:
zipfile.ZipFile.extractall(fp)
db_path = os.path.join(os.getcwd(), "database.sqlite")
def connect_database(db_path):
""" Instantiate a connection to a SQLite database
INPUT: db_path : (string) containing path to database
OUTPUT: Connection to database or error if fail """
connection = None
try:
connection = sqlite3.connect(db_path)
return connection
except sqlite3.Error as e:
print(e)
return connection
connection = connect_database(db_path)
tbl_list = ["Country", "League", "Match", "Player",
"Player_Attributes", "Team", "Team_Attributes"]
def select_all_tbls(connection, tables):
""" Return data from all tables included in the database
INPUT: connection (sqlite3 object) holding a connection to the db
OUTPUT: Data from tables specified in tables """
df_dict = {}
for tbl in tables:
df = pd.read_sql_query("SELECT * FROM {}".format(tbl), connection)
df_dict[tbl] = df
return df_dict
df_dict = select_all_tbls(connection, tables=tbl_list)
General Properties: After extracting the tables specified in
tbl_listinto our current working memory in df_dict (keys = table name, values = dataframes) we can begin our initial exploration. We need to keep in mind, that this way of loading all data into the workspace is computationally not the most efficient way, but as we want to get a first glimpse onto our data and ask what questions it can answer at all, I prefer doing this in Python and then finally just discard the dataframes that we will no longer consider for our final investigations. Another way would be to do a initial exploration in Text Editors first and ultimately just load the data we need into this workspace.
Procedure:
df_dict. df_dict into single dataframes. player_df. player_df only contains some very basic information about the players, like height and weight, birthdate and their database ID. The interesting attributes however can be found in the player_attr_df, so we will next have a quick look at it, including its dimensions, its column names and its data types.# 1
for item, vals in df_dict.items():
print(item, vals.shape)
# 2
import operator as op
country_df, league_df, match_df = op.itemgetter("Country", "League", "Match")(df_dict)
player_df, player_attr_df = op.itemgetter("Player", "Player_Attributes")(df_dict)
team_df, team_attr_df = op.itemgetter("Team", "Team_Attributes")(df_dict)
# 3
player_df.head()
# 3
player_df.describe()
# 3
player_df.info()
# 4
player_attr_df.head()
# 4
player_attr_df.describe()
# 4
player_attr_df.info()
# 5
del (country_df, league_df, match_df, team_df, team_attr_df)
Specific Properties: We now have two dataframes left (
player_dfandplayer_attr_df). We will use these to answer 3 key questions:
- Who are the top football players according to the Soccer Database ?
- Who are the top football players according to the Soccer Database ?
- What makes players good in
penalties?
Procedure:
player_df except from the real players name almost contains no useful information that we want to have a look at. We can use the ID columns to identify matching rows. For sake of saving time while typing store the new dataframe as df, though keep in mind that if we would need to handle many dataframes at once, this is not the best coding practice.df = pd.merge(player_df, player_attr_df, on = ['player_fifa_api_id', 'player_api_id'], how = "inner", suffixes = ["_player_df", "player_attr_df"])
df.head()
df.describe()
df.info()
# 1
print(df.isna().any(1).sum())
print(df.isna().sum().sum())
# 2
for col in df.columns:
print(col, "containing", df[col].isna().sum(), "null values")
# 3
for col in df.columns:
if df[col].dtype == "float64" or df[col].dtype == "int64":
df[col].fillna(df[col].mean(), inplace=True)
print(df.isna().any(1).sum())
print(df.isna().sum().sum())
float64, which is already perfect for doing statistical computings later. The birthdayand date columns however are of type string (object), so convert them to a pandas datetime series. This way, we can have a look at a players evolution over time later and handle these as time series more easily later.# 4
df.dtypes
df["date"] = pd.to_datetime(df.date)
df["birthday"] = pd.to_datetime(df.birthday)
df.dtypes
df.to_csv("processed_football_players.csv", index=False)
Part 1: Now that we have cleaned the dataset, investigated its datatypes and performed some initial statistics, its time to get a glimpse on what we can do with this dataset. The most obvious question now is, who these top players in this dataset are. We will explore it in two ways, once computationally by statistics and second by doing visuals.
Sort the dataframe according to the overall_rating column.
df.sort_values("overall_rating", ascending=False, inplace=True)
df.head(30)
According to this database, Lionel Messi is the highest ranked (~94 points) football player, even across differnt years, followed by Cristiano Ronaldo, Gianluigi Buffon and Wayne Rooney. In this way however, we cannot further do any visualisations because we have the same players accross multiple rows. We therefore will compute a summary statistic first, that we can then use for visualisations.
stats = df.groupby("player_name")["overall_rating"].agg(["mean", "median", "min", "max", "std"]).sort_values("mean", ascending = False)
stats = stats.reset_index()
stats
top50 = stats[:50] # crop the top 50 player
plt.style.use("seaborn")
plt.figure(figsize=(4,10))
sns.barplot(data = top50, x="mean", y="player_name", palette="mako");
plt.title("Top 50 ranked player");
plt.xlabel("Mean of global rating");
plt.ylabel("Player name");
plt.savefig("top50_players.png")
# extract years from the "date" column and store in "obs_year"
df["obs_year"] = df.date.dt.year
df["obs_year"]
df.obs_year.unique()
# subset the dataframe based on the top 10 names in the top50 dataframe.
evolution = df.loc[df.player_name.isin(top50.player_name[:10]), :]
plt.figure(figsize=(15,4))
sns.barplot(data = evolution, y="overall_rating", x="player_name", hue="obs_year", palette="mako_r");
plt.legend(bbox_to_anchor=(1,1), title="Years");
plt.title("Evolution of the top 10 players in the Soccer Database as rated by their FIFA score");
plt.xlabel("Player Name");
plt.ylabel("Overall Rating [0-100]");
plt.savefig("top10_evolution.png")
As we can see in this barchart, Lionel Messi has largely improved his score from 2007 to 2016, so does Cristiano Ronaldo. Players like Iker Cassillas and Wayne Rooney have however lost some points throughout the years. Others, like Philipp Lahm or Zlatan Ibrahimovic have mostly stayed at the same score throughout the period of observation.
Create a correlation heatmap to get a first overview of player attributes that correlate (EDA), then extract attributes that show a high correlation to the overall player score and examine them further.
plt.figure(figsize=(25,25))
exclude_cols = ["id", "player_fifa_api_id", "player_api_id"]
plot_cols = df.columns.difference(exclude_cols)
fig = sns.heatmap(df[plot_cols].corr(),
cmap="icefire",
annot=True, fmt = ".2f",
linewidths=0.1, linecolor="black");
plt.title("Correlation Heatmap of individual Player Attributes", fontsize=24);
fig.set_xticklabels(fig.get_xmajorticklabels(), fontsize = 16);
fig.set_yticklabels(fig.get_xmajorticklabels(), fontsize = 16);
fig.patch.set_edgecolor("black");
fig.patch.set_linewidth("2")
plt.savefig("correlations.png")
As we can see in this correlation heatmap, some attributes show a very obvious relationshipt. Standing tackle and sliding tackling show a correlation of ~0.95, players that provide good long shots also show high ball control (0.79) and players that are good at dribbling show great ball control too (0.9). We next want to know which of these attributes are most likely to be of utmost importance for achieving great FIFA scores. Therefore create a
corr_dict, this is going to store all variables correlation value when compared to theoverall_ratingcolumn.
corr_dict = {}
o_r = df.overall_rating
for col in df.columns.difference(["overall_rating"]):
try:
corr_dict[col] = o_r.corr(df[col])
except Exception as e:
""
# sort the dictionary reversely and show the top 10
sorted_correls = sorted(corr_dict.items(), key = lambda x: x[1])
sorted_correls[-10:]
Overall, it does not appear that there is one single player characteristic that makes a good player. As we know, players like Cristino Ronaldo or Lionel Messi are very versatile players, they are great at ball control, dribbling, give excellent passes, make fantastic free kicks and penalties. It thus appears, that to be a good player like these two, one needs to achieve maximum performance in both of them. We therefore might need to narrow down our range of observation from an overall rating towards something more specific, like looking at individual attributes alone. In the last section, we will try to find out if theres any pattern of attributes that players need to have in order to achieve a maximum perfomance in
penalties.
penalties ?¶pen_corr = {}
pen = df.penalties
for col in df.columns.difference(["penalties"]):
try:
corr = pen.corr(df[col])
pen_corr[col] = corr
except Exception as e:
""
# sort the dictionary, with increasing correlation values
items = sorted(pen_corr.items(), key = lambda x: x[1])
plot_items = sorted(pen_corr.items(), key = lambda x: x[1])[-10:]
items
plot_items = np.array(plot_items)
cols = plot_items.T[0]
dfs = df.sample(1000, replace=False)
g = sns.PairGrid(data=dfs, vars=cols);
g.map_diag(plt.hist, color="lightblue");
g.map_offdiag(plt.scatter, color="lightgreen");
OneThe best football players in the world are Lionel Messi and Cristiano Ronaldo. Both have even improved throughout the years. This is not surprising as both we're very young at the time recording started. Others like Iker Cassillas are a generation older, and their score however decreased throughout the observation period.
TwoWe cannot really get theoneplayer characteristic that defines a good rated player, as the maximum correlation values we obtain are only around 0.7 or lower. It appears, that a good player needs a mixture of all of these.
ThreeWe next wanted to have a look, what attributes are most important in defining players that perform great penalties. We would excpet naturally that attributes like precision, strength, acceleration would be required. Indeed, attributes like positioning, finishing and long shots are the three most important features. Players that show precision and are able to provide long passes are likely to also score penalties. Volleys are also (0.7) correlated with scoring penalties. Playing good volles requires a lot of ball control and this is most likely to be of benefit also wenn it accounts for penalties.
OutlookNext, one could build some kind of recommendation engine for players. Players could look up their scores, and recieve a recommendation of what they should improve on (for instance penalties). Then the computer could show them the items like positioning finishing and volleys and tell them to spend more time studying these in order to shoot better penalties. It could also look up players that show an extraordinarily high score in each of these features and tell them to analyse these players in more depth in order to learn from them.